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ABSTRACT 

Organization and coordination of agents within large-scale, 
complex, distributed environments is one of the primary 
challenges in the field of multi-agent systems. A lot of in- 
terest has surfaced recently around self-* (self-organizing, 
self-managing, self-optimizing, self-protecting) agents. This 
paper presents polymorphic self-* agents that evolve a core 
set of roles and behavior based on environmental cues. The 
agents adapt these roles based on the changing demands of 
the environment, and are directly implementable in com- 
puter systems applications. The design combines strate- 
gies from game theory, stigmergy, and other biologically 
inspired models to address fault mitigation in large-scale, 
real-time, distributed systems. The agents are embedded 
within the individual digital signal processors of BTeV, a 
High Energy Physics experiment consisting of 2500 such pro- 
cessors. Results obtained using a SWARM simulation of the 
BTeV environment demonstrate the polymorphic character 
of the agents, and show how this design exceeds performance 
and reliability metrics obtained from comparable central- 
ized, and even traditional decentralized approaches. 

Categories and Subject Descriptors 

1.2.11 [Artificial Intelligence]: Distributed Artificial In- 
telligence — multiagent systems, intelligent agents, coherence 
and coordination 

General Terms 

design, experimentation 
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1. INTRODUCTION 

In the field of multi-agent systems, a lot of attention has 
been focused lately on investigating various architectures 
and methodologies that promote effective organization and 
coordination within large-scale, complex, distributed sys- 
tems 9 2 . Specifically, the interest is in developing ap- 
proaches that can be implemented within multi-agent sys- 
tems to produce some desirable emergent behavior that co- 
ordinates individual actors in a system competing for re- 
sources such as bandwidth, computing power, and data. 

Agent methodologies that exhibit self-* (self-organizing, 
self-managing, self-optimizing, self-protecting) attributes are 
of particular value |5)|16|. This paper introduces polymor- 
phic self-* agents that are capable of multiple roles as di- 
rected by the environment. These agents evolve an optimum 
core set of roles for which they are responsible, while still 
possessing the ability to take on alternate roles as environ- 
mental demands change. They are directly implementable 
in computer systems applications. 

The approach is based on stigmergy, a concept that ex- 
plains organization and coordination within social insect so- 
cieties that rely strictly on environmental cues for indirect 
communication between individuals. It is implemented on 
BTeV, a particle accelerator-based High Energy Physics ex- 
periment currently under development at Fermi National 
Accelerator Laboratory. Multiple layers of polymorphic, 
very lightweight agents (VLAs) are embedded within 2500 
Digital Signal Processors (DSPs) to handle fault mitigation 
across the system. The primary challenge is to determine the 
frequency at which VLAs should perform specific monitoring 
tasks. Results show how polymorphic self-* VLAs evolve in- 
dependently to find the optimum rate at which monitoring 
and fault mitigation tasks should occur. SWARM multi- 
agent simulation software is used to model RTES/BTeV. 

This paper is divided into four sections. First, some back- 
ground on polymorphism and stigmergy, along with the BTeV 
experiment itself is provided. A description of VLAs embed- 
ded within Level 1 of the RTES/BTeV environment is pro- 
vided, followed by an explanation of current challenges and 
other motivating factors. Section 3 then introduces poly- 
morphic self-* agents and describes the design in detail. Re- 
sults of a SWARM simulation of the RTES/BTeV environ- 
ment that implements the polymorphic self-* approach are 
then evaluated in Section 4. Finally, next steps and a con- 



elusion are provided. 

2. BACKGROUND AND MOTIVATION 

2.1 Polymorphism and Stigmergy 

Concepts of polymorphism and stigmergy are founded in 
biology and the study of self-organization within social in- 
sects. The term polymorphism is used in describing ants 
and other social biological systems, and is defined as the 
occurrence of different forms, stages, or types in individual 
organisms or in organisms of the same species, independent 
of sexual variations |23 ) |15 | . Within an individual colony 
consisting of ants with the same basic genetic wiring, two 
or more castes belonging to the same sex can be found. A 
caste here is defined as a differentiated morphological form 
with a specialized function, or at least the infrequent relict 
of such a form. The function or role that any individual ant 
takes on is dictated by cues from the environment 1221 . 

The agents described in detail in section 3 of this paper 
adhere to this definition of polymorphism in that they are 
genetically identical, yet each evolve distinct roles that they 
play as demanded of them through changes in the environ- 
ment. 

The concept of polymorphic agents presented in this paper 
is different from other definitions of polymorphism that have 
surfaced in computer science. In object-oriented program- 
ming, polymorphism is usually associated with the ability of 
objects to override inherited class method implementations 
|I2| . The term has also arisen in other subareas of computer 
science, including some agent designs pQ, but generally de- 
scribes a templating based system or similar variation of the 
object-oriented model. On the other hand, techniques that 
attempt to evolve specialized agents are one of the central 
themes under investigation in the field of large-scale multi- 
agent systems |21| . 

Stigmergy was introduced by biologist Pierre-Paul Grasse 
to describe indirect communication that takes place between 
individuals in social insect societies |10|. The theory ex- 
plains how organization and coordination of the building of 
termite nests is mainly controlled by the nest itself, and not 
the individual termite workers involved. It views the process 
of emergent cooperation as a result of participants altering 
the environment and reacting to the environment as they 
pass through it. The canonical example of stigmergy is ants 
leaving pheromones in ways that help them find the short- 
est, safest distance to food or to build nests. Ant colony 
optimization methods alone have had a wide impact on co- 
ordination within multi-agent systems, addressing various 
adaptive network routing and load balancing problems Q][Z|- 

A stigmergic approach to fault mitigation is introduced 
in this paper. Individual agents communicate indirectly 
through errors that they find (or do not find) in the environ- 
ment. This indirect communication is manifested through 
actions that each agent takes as cued by the environment. 
Results show how the local actions of agents allow self-* 
global behavior to emerge. 

2.2 RTES/BTeV 

BTeV is a proposed particle accelerator-based HEP exper- 
iment currently under development at Fermi National Ac- 
celerator Laboratory. The goal is to study charge-parity vio- 
lation, mixing, and rare decays of particles known as beauty 



and charm hadrons, in order to learn more about matter- 
antimatter asymmetries that exist in the universe today |14|. 

The experiment uses approximately 30 planar silicon pixel 
detectors that are connected to specialized field-programmable 
gate arrays (FPGAs). The FPGAs are connected to ap- 
proximately 2500 digital signal processors (DSPs) that filter 
incoming data at the extremely high rate of approximately 
1.5 Terabytes per second from a total of 20xl0 6 data chan- 
nels. A three tier hierarchical trigger architecture will be 
used to handle this high rate |14| . An overview of the BTeV 
triggering and data acquisition system is shown in Figure 
including a magnified view of the LI Vertex Trigger respon- 
sible for Level 1 filtering consisting of 2500 Worker nodes 
(2000 Track Farms and 500 Vertex Farms). 

There are many Worker level tasks that the Farmlet VLA 
(FVLA) is responsible for monitoring. A traditional hierar- 
chical approach would assign one (or more) distinct DSPs 
the role of the FVLA, with the responsibility of monitoring 
the state of other Worker DSPs on the node 5.. However, 
this leaves the system with only very few possible points of 
failure before critical tasks are left unattended. 

Another approach would be to assign a single redundant 
DSP (or more) to each and every Worker DSP, to act as 
the FVLA However, since 2500 Worker DSPs are pro- 
jected, this would prove expensive and may still not fully 
protect all DSPs given even a low number of system failures. 
The events that pass the full set of physics algorithm filters 
occur infrequently, and the cost of operating this environ- 
ment is high. The extremely large streams of data resulting 
from the BTeV environment must be processed real-time 
with highly resilient adaptive fault tolerant systems. 

2.3 Very Lightweight Agents (VLAs) 

Multiple levels of very lightweight agents (VLAs) |19| are 
one of the primary components responsible for fault mitiga- 
tion across BTeV. 

The primary objective of the VLA is to provide the BTeV 
environment with a lightweight, adaptive layer of fault mit- 
igation. One of the latest phases of work at Syracuse Uni- 
versity has involved implementing embedded proactive and 
reactive rules to handle specific system failure scenarios. 

A scaled prototype of the Level 1 RTES/BTeV environ- 
ment was presented at the SuperComputing 2003 (SC2003) 
conference |18j . Reactive and proactive VLA rules were in- 
tegrated within this Level 1 prototype and served a primary 
role in demonstrating the embedded fault tolerant capabili- 
ties of the system. 

2.4 Challenges 

While the SC2003 prototype was effective for demonstrat- 
ing the real-time fault mitigation capabilities of VLAs on 
limited hardware utilizing 16 DSPs, one of the major chal- 
lenges is to find out how the behavior of the various lev- 
els of VLAs will scale when implemented across the 2500 
DSPs projected for BTeV |13|. In particular, how frequently 
should these monitoring tasks be performed to optimize pro- 
cessing time, and what affect does this have on other com- 
ponents and the overall behavior of a large-scale real-time 
embedded system such as BTeV. 

Given the number of components and countless fault sce- 
narios involved, it is infeasible to design an 'expert system' 
that applies mitigative actions triggered from a central pro- 
cessing unit acting on rules capturing every possible system 
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Figure 1: The BTeV triggering and data acquisition system showing (left side) detector, buffer memories, 
LI, L2, L3 clusters and their interconnects and (right side) a magnified figure of the LI Vertex trigger. 



state. Instead, a distributed approach using self-organizing 
VLAs accomplishes fault mitigation within the large-scale 
real-time RTES/BTeV environment. 

2.5 SWARM 

SWARM (http://www.swarm.org), distributed under the 
GNU General Public License, is software available as a Java 
or Objective-C development kit that allows for the multi- 
agent simulation of complex systems [3 El- It consists of a 
set of libraries that facilitate implementation of agent-based 
models. SWARM has previously been used by the RTES 
team in simulations that model the RTES/BTeV environ- 
ment |17| . 

3. POLYMORPHIC SELF-* AGENTS 
3.1 Overview 

This paper introduces a stigmergic multi-agent systems 
approach that uses polymorphic self-* agents to address the 
weaknesses inherent in traditional hierarchical fault mitiga- 
tion designs. Rather than hard-wiring the assignment of 
FVLA roles to specific VLAs embedded within individual 
DSPs, VLAs are made polymorphic so that every VLA is 
equipped to play the role of FVLA for any DSP on the 
same node. 

Since the FVLA is responsible for a wide range of moni- 
toring tasks, this means that we must build the capability 
of performing each task into every Worker Level VLA. The 
classic problem this presents in traditional hierarchical ap- 
proaches is how to process all of the data necessary for all 
of these tasks in time for a useful response |24|. However, 
since these agents are polymorphic and evolve roles gradu- 
ally over time, there is only a small set of tasks for which 
each agent is responsible for at any given point in time. 

Stigmergy is used to determine which set of tasks any 



given VLA performs. Errors found (or not found) in the en- 
vironment by an individual VLA increase (decrease) the sen- 
sitivity of that VLA to that particular type of error. Agents 
start out by monitoring each type of error at a fixed rate. 
Then, based entirely on what is encountered in the environ- 
ment, each develops a core set of roles for which it takes 
responsibility. For example, a single VLA embedded within 
a DSP monitors each particular error at some unique rate. 
When an individual VLA performs a monitoring task on 
some DSP, it either finds an error and performs mitigative 
action, or does not find an error and does nothing. If it 
finds an error, it increases its own sensitivity to that type 
of error on the corresponding DSP. If it does not find an 
error, its sensitivity to the error decreases slightly. Results 
show how, over time, this produces an optimal distribution 
of monitoring tasks across all VLAs, with each VLA evolving 
responsibility for a unique core set of monitoring tasks. 

The overall emergent behavior of this design results in 
self-organization of FVLA responsibilities based on the state 
and workload of all DSPs within the node. A certain set of 
VLAs may perform specific FVLA tasks at one moment, and 
another set (which may or may not include VLAs from the 
original set) can be found performing these same tasks later 
in time. The organization occurs automatically within the 
system as environmental cues fluctuate. This eliminates the 
financial and efficiency costs associated with having special- 
ized FVLAs that at times sit idle as Worker DSPs operate 
at full capacity and fall behind on event processing. It also 
increases the efficiency of Worker DSPs that may be wasting 
idle time when crossing processing rates are low. In effect, a 
fully connected network of FVLAs is created that continue 
to provide effective fault mitigation when exposed to a high 
volume of system failures. 

There are two key characteristics of this model. The first 
is that it requires no central management or global process- 



ing. Second, it is optimally reliable since FVLA monitoring 
tasks are distributed across all DSPs, and can be adapted 
based on changes in the environment. The next section ex- 
plains implementation details on how each individual agent 
uses only cues from the environment to determine necessary 
actions. 

3.2 Implementation 

As described above, distributed VLAs within Worker level 
DSPs are used to accomplish the fault monitoring tasks that 
the FVLA is responsible for. However, these are the same 
DSPs that are responsible for the critical overall objective 
of Level 1 physics application (PA) data filtering IM| . It 
is therefore extremely important that DSP usage by each 
Worker VLA is minimal, and only occurs either when the 
PA is not fully utilizing the DSP, or when critical fault mit- 
igative action is required. 

Game theory has been applied to a wide range of prob- 
lems, and is used here to coordinate the amount of DSP 
clock cycle that is allocated between the PA and the VLA. 
Both the PA and VLA wish to maximize the number of 
clock cycles during which they have control. If the VLA 
takes too many DSP cycles, then the PA will be unable to 
process the incoming data at a high enough rate to prevent 
the buffers from overflowing, resulting in a loss of data con- 
tinuity. This is often fatal for the experiment since this lost 
data could very well contain portions of vital characteristics 
of the physics properties being evaluated. If on the other 
hand, the PA takes too many DSP cycles, then it runs the 
risk that system faults will go undetected, resulting in accep- 
tance of corrupt data, and/or incremental bottlenecks that 
again cause buffer overflows. 

An efficient adaptive scheduling algorithm is required that 
will effectively establish scheduling priorities between the 
PA and VLA. Mandatory costs associated with the Ker- 
nel/Command Processor, including clock cycle costs for con- 
text switching must be factored in. An analysis of the worst- 
case behavior of tasks (both VLA and PA) can be done to 
determine the amount of time that must be allotted to each 
process. However, there must be a way for the system to 
adaptively modify these values when environmental condi- 
tions change. That is, if during every interval T, the HEP 
applications and the operating system use Tpa and Tos 
time units, respectively, then the VLA will be allowed to 
use T - Tpa - Tos every T time units [19] . 

An analysis of best-case behavior of tasks (VLA and PA) 
requires the use of a utility value in order for each DSP 
to determine locally precisely when the PA or VLA should 
relinquish control |2U|. A reward system based on a com- 
bination of the amount of data processed, along with the 
frequency of VLA maintenance checks, is used by each DSP 
for each error in calculating the following local utility value : 

DSP Utility Value = Dw" 1 + cF _1 , where 

D = Expected amount of data that DSP could process 

during a given time interval (T). 
w = Current data buffer watermark. 
F = Total number of clock cycles elapsed since last 

FVLA check on neighboring DSPs, 
c = Adaptive constant representing weight to place on 

FVLA checks. 



Since the amount of data that any single DSP can process 
(D) over a given time interval is mostly fixed, the utility 
value essentially involves summing the inverse of the current 
data buffer watermark (w _1 ) with a weighted value for the 
inverse of the time elapsed since individual FVLA tasks were 
last performed (F _1 ). 

The task currently active (PA or VLA) calculates the op- 
timum expected utility value for the DSP at a time interval 
based on the criticality of each error. If the active process 
determines that a higher DSP utility value is received by re- 
maining active, then the active task will continue. However, 
if a higher utility value can be gained by passing control 
to the currently inactive process, then that is what does. 
For example, if the PA is currently active, the input data 
buffer for a given DSP is low, and FVLA monitoring respon- 
sibilities for a specific error have not been performed on a 
particular DSP in a long time, then the VLA task will be 
made active. If however, the VLA was currently active under 
these conditions, then the VLA would simply maintain con- 
trol for another T time steps, at which time corresponding 
utility values would again be calculated. This is equivalent 
to determining : 

max(w, 2 x ((1 / (1 + e~ dF )) - .5) 

the maximum value of either w or 2 x ((sigmoid function 
value for F) - .5). Here, 2 x ((1 / (1 +e~ dF )) - .5) is 
an adjusted sigmoid function for F which represent F as a 
weighted value between and 1. 

It is important to note here that the value assigned to d 
determines the steepness of the sigmoid function, and hence 
the sensitivity of the agent to a given error. In other words, 
the higher the value of d, the higher the adjusted sigmoid 
value of F, and the higher the sensitivity (the frequency of 
checks) of the VLA to a particular error. 

This is where the polymorphic behavior of the VLA is in- 
troduced. Any time that an individual VLA finds a specific 
error while performing FVLA monitoring tasks, the d value 
for that error on that particular node is increased. Any time 
that an individual VLA performs a monitoring task and does 
not find an error, the d value is slightly decreased. 

A high value for F means that FVLA tasks are performed 
more frequently (high sensitivity), whereas a low value for F 
means they are performed less often (low sensitivity). The 
PA is passed (or maintains) control if w is higher than this 
adjusted sigmoid function value for F, otherwise the VLA is 
passed (maintains) control. For example, if the PA is cur- 
rently active, the input data buffer watermark for a given 
DSP is about half full (w=.5), and FVLA functions have re- 
cently been performed (the adjusted sigmoid function value 
for F is, say, .15) then the PA will remain active. 

4. RESULTS 

SWARM simulates Farmlet data buffer queues that are 
populated at a rate consistent with the behavior of the in- 
coming physics crossing data. Each DSP within a given 
Farmlet processes a fixed amount of data at each discrete 
time step. Three distinct types of errors are introduced ran- 
domly within each Worker DSP at a variable rate using a 
Multiply With Carry (RWC8gen) random number genera- 
tor with a fixed seed. Any time a software or hardware 
error is encountered within the simulation, the processing 
rate for that DSP decreases a set amount depending on the 
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Figure 2: The VLA d-value (sensitivity) for 3 distinct error types (el, e2, e3) being monitored on DSP1. 
Each of the 5 graphs represent the d-value adapted over time by each of the remaining 5 DSPs (DSP2 - 
DSP6) on the same Farmlet. The simulation fluctuated the error rate between a moderate rate (5 x 10 -4 ) 
for the first 35000 time steps, a low rate (5 x 10" 6 ) for the next 35000 time steps (35001 - 70000), and a high 
rate (5 x 10" 3 ) for the last 30000 time steps (70000 - 100000). 



type of error. The error is cleared when any DSP within 
the same Farmlet performs FVLA checks against the DSP 
for the error type present. However, there is a time cost 
associated with performing these checks. As detailed in the 
section above describing the self-organizing model, the DSP 
must decide whether or not it is worth taking time to per- 
form FVLA monitoring tasks against neighboring DSPs. If 
checks are performed too frequently, then the time available 
for data crossing processing is limited. On the other hand, if 
they are not performed frequently enough, then the chance 
that other DSPs within the same Farmlet are experiencing 
errors is high. As described, a high error rate will also lead 
to slow processing rates. 

The formula designed for these experiments calculates the 
frequency of performing FVLA tasks for neighboring DSPs 
as a sigmoid function adjusted to a value between 0.0 and 
1.0. The fullness of the crossing data buffer queue is also 
a value between 0.0 and 1.0 representing the data water- 
mark percentage. These two values are weighed against each 
other, and the DSP makes a decision on where to devote its 
energy as described in detail in the last section. 

The decision of whether the VLA or PA has control of the 
DSP is made by each DSP at each time step in the SWARM 
simulation. In this way, the monitoring tasks required by the 



environment are always met, but not necessarily by one (or 
a few) designated DSPs. Instead, these tasks are performed 
by any polymorphic DSP within the Farmlet as dictated by 
the changing needs of the environment. 

The DSPs themselves self-organize as different DSPs within 
the Farmlet take on the necessary monitoring tasks at dif- 
ferent points in time as required by the environment. If a 
DSP performs FVLA monitoring tasks for a given type of 
error on a neighboring DSP, it will either determine that the 
error is not present, or it will find the error and perform the 
designated mitigative actions. In the case where an error 
is found, the d-value for that particular error on the spe- 
cific DSP is increased. As described in detail earlier, this 
essentially increases the sensitivity of the VLA for this type 
of error. On the other hand, if no error is found, then the 
d-value (sensitivity) is slightly decreased. 

As detailed next, Figure|2]shows how the local action per- 
formed by each VLA over a short period of time results in 
VLAs evolving responsibility for a core set of fault monitor- 
ing tasks. Over the 100000 time steps for which the SWARM 
simulation is run, the 5 VLAs (1 per DSP) can be seen tak- 
ing on distinct roles that lead to an efficient global fault 
mitigation strategy for monitoring errors on DSP1. These 
roles are evolved using local information only, and rely on 
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Figure 3: Average number of crossings pro- 
cessed per DSP resulting from the stigmergic ap- 
proach using polymorphic agents(adaptive), com- 
pared against the same simulation using a fixed mon- 
itoring rate (d- value fixed at .01). 



stigmergy within the environment for indirect coordination 
with other VLAs. 

The simulation fluctuates the error rate at various inter- 
vals in order to demonstrate the affect changes in error rate 
can have on polymorphic behavior. A moderate error rate (5 
x 10 -4 ) is used for the first 35000 time steps, a low error rate 
(5 x 10 ~ 6 ) for the next 35000 time steps (35001-70000), and 
the last 30000 time steps (70001-100000) use a high rate (5 x 
10 ~ 3 ). Figure^shows how all of the VLAs are able to adjust 
sensitivity to errors on DSP1 based on these fluctuating er- 
ror rates over time. For example, the d-value (sensitivity) to 
individual errors on DSP1 for all 5 VLAs (embedded within 
DSP2 - DSP6) can be seen dropping beginning around time 
step 35000, and then increasing dramatically again at time 
step 70000 in reaction to the significant increase in error 
rate. 

Polymorphism is demonstrated clearly in Figure H which 
displays the VLA d-value (sensitivity) for 3 distinct error 
types being monitored on DSP1 within a single Farmlet. 
The d- values evolved by each of the VLAs within the 5 DSPs 
(DSP2-DSP6) monitoring DSP1 within the same Farmlet 
are shown. When the error rate is high (from time steps 
70000-100000), the VLAs embedded within DSP3 and DSP6 
develop a high sensitivity for error type 1 (el), while the 
sensitivity for el of the VLAs in the remaining DSPs remains 
low. Similarly, the VLAs on DSP2 and DSP5 have a high 
sensitivity for error type 2 (e2), and VLAs for DSP2 and 
DSP3 are highly sensitive to e3. 

The moderate error rate used for the first 35000 time 
steps reveals additional polymorphic characteristics of this 
approach. Here, the error rate is not quite high enough for 
any single VLA to evolve long term responsibility for an 
individual error type on DSP1. Instead, 1 or 2 VLAs can 
be seen monitoring a single error type at one moment, and 
then a separate VLA (or group of VLAs) can be seen mon- 
itoring the same error type a short time later. This is due 
to the fact that the error rate is too low to stimulate high 
sensitivity in a single VLA. Sensitivity for the error type 
drops to a level comparable with other available VLAs on 
the Farmlet. For example, the VLAs on DSP 3 and DSP 



4 develop a modest level of sensitivity for el early on (time 
steps 0-15000), but the role is taken over by VLAs on DSP 

5 (time steps 15000-28000) and later DSP6 (28000-35000). 
Figure [3] shows the average data processing rate per DSP 

for the stigmergic approach using polymorphic agents, as 
compared to the same simulation using a fixed monitoring 
rate (d-value fixed at .01) for each agent. The polymorphic 
agents in the stigmergic approach adapt an optimum mon- 
itoring rate for each error based strictly on the demands of 
the environment at any given time. This results in a higher 
number of crossings processed since, as described in detail 
earlier, less time is wasted performing needless monitoring 
tasks or missing critical errors. 

5. NEXT STEPS 

The next phase of this project will expand the number of 
different types of errors handled, along with the amount of 
fluctuation in error rates. It will also focus further on how 
sensitivity (d-value) is adapted for each VLA. Currently, a 
rudimentary method is used that slightly increases (or de- 
creases) sensitivity based on the presence (or absense) of an 
error. Other variables could be considered in determining 
the amount of change to apply, such as factoring in the sever- 
ity level of the error, or looking at the consequences of other 
recently taken actions. An enhanced evaluation methodol- 
ogy to better demonstrate the performance advantage of this 
approach as compared to other traditional methodologies is 
also necessary. 

Another issue being investigated is how to handle com- 
munication between agents when one agent has informa- 
tion that may be relevant to other agents, but it does not 
know to which other agent the information is relevant. This 
is a problem encountered in many large-scale multi-agent 
systems |21| . and is especially an issue in fault mitigation 
where trends in information received across agents can pro- 
vide valuable warning signs. 

At the same time, another scaled prototype of the actual 
projected RTES/BTeV software and hardware environment 
based on the SC2003 demonstration system is also being 
developed, and will integrate the VLA self-* model. This 
prototype will be presented at the 2nd Workshop on High- 
Performance Fault- Adaptive Large-Scale Embedded Real- 
Time Systems (FALSE-II) in the IEEE Real-Time and Em- 
bedded Technology and Applications Symposium (RTAS05). 



6. CONCLUSION 

This paper has described a fully distributed stigmergic 
approach to fault mitigation in large-scale real-time sys- 
tems using lightweight, polymorphic, self-* agents embed- 
ded within individual DSPs. Stigmergy facilitates indirect 
communication and coordination between agents using cues 
from the environment, and concepts from game theory and 
polymorphism allow individual agents to evolve a core set 
of roles for which it is responsible. Agents adapt these roles 
as environmental demands change. The approach is imple- 
mented on a SWARM simulation of BTeV, a High Energy 
Physics experiment consisting of 2500 DSPs. 

Results demonstrate the polymorphic nature of the agents, 
and display the performance and reliability advantages of 
this approach. The next phase of this project will increase 
the number of possible error types, and add more fluctuation 



to individual error rates. More sophisticated ways of adapt- 
ing error sensitivity among agents will also be investigated, 
along with more elaborate performance evaluation metrics. 
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